1 Introduction

The dataset for this project is Australia’s data science job listings for August 2022 which can be downloaded on Kaggle. It was scrapped from Glassdoor which is a website where current and former employees anonymously review companies.

2 Setup

Clean workspace and set random seed.

rm(list = ls())
set.seed(1)

Code to allow truncating text output especially for str() and head() functions.

# Save the built-in output hook
hook_output <- knitr::knit_hooks$get("output")

# Set a new output hook to truncate text output
knitr::knit_hooks$set(output = function(x, options) {
  if (!is.null(n <- options$out.lines)) {
    x <- xfun::split_lines(x)
    if (length(x) > n) {
      # Truncate the output
      x <- c(head(x, n), "....\n")
    }
    x <- paste(x, collapse = "\n")
  }
  hook_output(x, options)
})

Load libraries and data.

library(ggplot2)
library(WVPlots)
library(ozmaps)
library(sf)
library(dplyr)
library(gridExtra)
library(knitr)
library(ROCR)
library(ROCit)
library(caret)
library(lime)
library(grDevices)
library(tidyverse)
library(fpc)
library(klaR)

df <- read.csv(file = "AustraliaDataScienceJobs.csv", header = TRUE)

# Replace empty strings with NA
df[df == ""] <- NA

Set up a custom theme so all the plots can be standardised as well as having the ability to customise to their needs.

custom_theme <- function(
  title_y_r_margin = 10,
  title_x_t_margin = 10,
  title_b_margin = 20,
  title_size = 20,
  label_size = 16,
  font_size = 14
) {
  theme(
    axis.title.x = element_text(margin = margin(t = title_x_t_margin)),
    axis.title.y = element_text(margin = margin(r = title_y_r_margin)),
    axis.title = element_text(size = label_size),
    plot.title = element_text(
      size = title_size,
      margin = margin(b = title_b_margin)
    ),
    text = element_text(size = font_size)
  )
}

3 Data Cleaning and Transformation

3.1 Data Exploratory, Cleaning

Use str() to have a glance at the data.

str(df)
## 'data.frame':    2088 obs. of  53 variables:
##  $ Job.Title                    : chr  "Analyst" "Clinical Research Associate" "Clinical Research Associate" "Clinical Research Associate" ...
##  $ Job.Location                 : chr  "Melbourne" "Mulgrave" "Mulgrave" "Mulgrave" ...
##  $ Company                      : chr  "ANZ Banking Group" "Bristol Myers Squibb" "Bristol Myers Squibb" "Bristol Myers Squibb" ...
##  $ Url                          : chr  "https://www.glassdoor.com.au/partner/jobListing.htm?pos=101&ao=1110586&s=58&guid=000001826eaa3c8799de46d597a14f"| __truncated__ "https://www.glassdoor.com.au/partner/jobListing.htm?pos=201&ao=1110586&s=58&guid=000001826eaa43e3934cf1dcd2d0c4"| __truncated__ "https://www.glassdoor.com.au/partner/jobListing.htm?pos=301&ao=1110586&s=58&guid=000001826eaa4c5285b0cd598cb71a"| __truncated__ "https://www.glassdoor.com.au/partner/jobListing.htm?pos=301&ao=1110586&s=58&guid=000001826eaa551389362103f4b33f"| __truncated__ ...
##  $ Estimate.Base.Salary         : int  95917 96555 96555 96555 115631 115631 62718 115631 212000 212000 ...
##  $ Low.Estimate                 : int  80000 79000 79000 79000 94000 94000 62000 94000 173000 173000 ...
##  $ High.Estimate                : int  115000 118000 118000 118000 143000 143000 63000 143000 251000 251000 ...
##  $ Company.Size                 : chr  "10000+ Employees" "10000+ Employees" "10000+ Employees" "10000+ Employees" ...
##  $ Company.Type                 : chr  "Company - Public" "Company - Public" "Company - Public" "Company - Public" ...
##  $ Company.Sector               : chr  "Finance" "Pharmaceutical & Biotechnology" "Pharmaceutical & Biotechnology" "Pharmaceutical & Biotechnology" ...
##  $ Company.Founded              : num  1835 1858 1858 1858 1835 ...
##  $ Company.Industry             : chr  "Banking & Lending" "Biotech & Pharmaceuticals" "Biotech & Pharmaceuticals" "Biotech & Pharmaceuticals" ...
##  $ Company.Revenue              : chr  "Unknown / Non-Applicable" "$10+ billion (USD)" "$10+ billion (USD)" "$10+ billion (USD)" ...
##  $ Job.Descriptions             : chr  "-\nBuild a rewarding career in an innovative and collaborative environment\nHighly engaged culture aiming to su"| __truncated__ "At Bristol Myers Squibb, we are inspired by a single vision – transforming patients’ lives through science. In "| __truncated__ "At Bristol Myers Squibb, we are inspired by a single vision – transforming patients’ lives through science. In "| __truncated__ "At Bristol Myers Squibb, we are inspired by a single vision – transforming patients’ lives through science. In "| __truncated__ ...
##  $ Company.Rating               : num  3.7 3.9 3.9 3.9 3.9 4.1 4.2 4.1 4.4 4.4 ...
##  $ Company.Friend.Reccomendation: num  71 76 76 76 80 85 93 85 90 90 ...
##  $ Company.CEO.Approval         : num  75 82 82 82 83 86 77 86 94 94 ...
##  $ Companny.Number.of.Rater     : num  354 1165 1165 1165 2339 ...
##  $ Company.Career.Opportinities : num  3.9 3.6 3.6 3.6 3.9 3.9 4 3.9 4.1 4.1 ...
##  $ Compensation.and.Benefits    : num  3.7 3.9 3.9 3.9 3.7 3.7 3.6 3.7 4.3 4.3 ...
##  $ Company.Culture.and.Values   : num  4.1 3.8 3.8 3.8 4.1 4.1 4.1 4.1 4.5 4.5 ...
##  $ Company.Senior.Management    : num  3.7 3.5 3.5 3.5 3.7 3.7 4 3.7 4.1 4.1 ...
##  $ Company.Work.Life.Balance    : num  4 3.7 3.7 3.7 4 4 4 4 4.5 4.5 ...
##  $ Country                      : chr  "Australia" "Australia" "Australia" "Australia" ...
##  $ State                        : chr  "Victoria" "Victoria" "Victoria" "Victoria" ...
##  $ python_yn                    : int  0 0 0 0 1 1 0 1 1 1 ...
##  $ r_yn                         : int  0 0 0 0 1 1 0 1 0 0 ...
##  $ sql_yn                       : int  0 0 0 0 1 1 0 1 0 0 ...
##  $ java_yn                      : int  0 0 0 0 0 0 0 0 1 1 ...
....

There are 2088 observations of 53 variables.

3.2 Renaming Variables

There are a couple typos in the variable names let’s fix that first, and some of them are very verbose which we can simplify.

df <- rename(df,
  Number.of.Rater = Companny.Number.of.Rater,
  Career.Opportunities = Company.Career.Opportinities,
  Culture.and.Values = Company.Culture.and.Values,
  Senior.Management = Company.Senior.Management,
  Work.Life.Balance = Company.Work.Life.Balance,
  Friend.Recommendation = Company.Friend.Reccomendation
)

3.3 Dropping Columns

The variable Country should be all Australia because this is an Australia’s data science jobs dataset, all the job listings should be in Australia, otherwise it is not valid.

any(df$Country != "Australia")
## [1] FALSE

Confirmed that there is no invalid data for the Country variable, then it is essentially useless as all the observations are the same, so we should be able to drop it, along with some other columns that are just long chunk of text, Job Descriptions and Url in particular which are also not that useful to us.

df <- df[!names(df) %in% c("Country", "Job.Descriptions", "Url")]

3.4 Type Conversion

All the variables that end with _yn contain only 1s and 0s which should be converted to logical.

skill_columns <- grep("_yn", names(df))
df[skill_columns] <- sapply(df[skill_columns], function(col) {
  as.logical(col)
})

Then inspect the data again using head() to look at the first 3 observations.

head(df, 3)
##                     Job.Title Job.Location              Company
## 1                     Analyst    Melbourne    ANZ Banking Group
## 2 Clinical Research Associate     Mulgrave Bristol Myers Squibb
## 3 Clinical Research Associate     Mulgrave Bristol Myers Squibb
##   Estimate.Base.Salary Low.Estimate High.Estimate     Company.Size
## 1                95917        80000        115000 10000+ Employees
## 2                96555        79000        118000 10000+ Employees
## 3                96555        79000        118000 10000+ Employees
##       Company.Type                 Company.Sector Company.Founded
## 1 Company - Public                        Finance            1835
## 2 Company - Public Pharmaceutical & Biotechnology            1858
## 3 Company - Public Pharmaceutical & Biotechnology            1858
##            Company.Industry          Company.Revenue Company.Rating
## 1         Banking & Lending Unknown / Non-Applicable            3.7
## 2 Biotech & Pharmaceuticals       $10+ billion (USD)            3.9
## 3 Biotech & Pharmaceuticals       $10+ billion (USD)            3.9
##   Friend.Recommendation Company.CEO.Approval Number.of.Rater
## 1                    71                   75             354
## 2                    76                   82            1165
## 3                    76                   82            1165
##   Career.Opportunities Compensation.and.Benefits Culture.and.Values
## 1                  3.9                       3.7                4.1
## 2                  3.6                       3.9                3.8
## 3                  3.6                       3.9                3.8
##   Senior.Management Work.Life.Balance    State python_yn  r_yn sql_yn java_yn
## 1               3.7               4.0 Victoria     FALSE FALSE  FALSE   FALSE
## 2               3.5               3.7 Victoria     FALSE FALSE  FALSE   FALSE
## 3               3.5               3.7 Victoria     FALSE FALSE  FALSE   FALSE
....

Looking much cleaner now.

3.5 Missing Values

Count the missing values in each variable.

count_missing <- function() {
  na_counts <- sapply(df, function(col) length(which(is.na(col))))
  na_counts[which(na_counts > 0)]
}
count_missing()
##                 Job.Title              Company.Size              Company.Type 
##                         3                       181                       181 
##            Company.Sector           Company.Founded          Company.Industry 
##                       567                       890                       567 
##           Company.Revenue            Company.Rating     Friend.Recommendation 
##                       181                       311                       356 
##      Company.CEO.Approval           Number.of.Rater      Career.Opportunities 
##                       764                       311                       318 
## Compensation.and.Benefits        Culture.and.Values         Senior.Management 
##                       318                       318                       318 
##         Work.Life.Balance 
##                       318

The 3 jobs that do not even have a Title should be dropped.

df <- df[!is.na(df$Job.Title), ]
count_missing()
##              Company.Size              Company.Type            Company.Sector 
##                       180                       180                       566 
##           Company.Founded          Company.Industry           Company.Revenue 
##                       889                       566                       180 
##            Company.Rating     Friend.Recommendation      Company.CEO.Approval 
##                       310                       355                       763 
##           Number.of.Rater      Career.Opportunities Compensation.and.Benefits 
##                       310                       317                       317 
##        Culture.and.Values         Senior.Management         Work.Life.Balance 
##                       317                       317                       317

The variable Company Founded is the year when the company is founded in and Company CEO Approval is the percentage of employees who approve their CEO. They both have around 40% of missing values, the first one is probably because the companies did not enter the year founded and the second one might be because the employees are scared to say about their CEO. Due to these reasons they do not have enough useful information to be kept.

df <- df[!names(df) %in% c("Company.Founded", "Company.CEO.Approval")]
count_missing()
##              Company.Size              Company.Type            Company.Sector 
##                       180                       180                       566 
##          Company.Industry           Company.Revenue            Company.Rating 
##                       566                       180                       310 
##     Friend.Recommendation           Number.of.Rater      Career.Opportunities 
##                       355                       310                       317 
## Compensation.and.Benefits        Culture.and.Values         Senior.Management 
##                       317                       317                       317 
##         Work.Life.Balance 
##                       317

All 3 variables Company Size, Company Type and Company Revenue are missing 180 observations, let’s see if they are the same ones.

summary(cbind(
  Company.Size = which(is.na(df$Company.Size)),
  Company.Type = which(is.na(df$Company.Type)),
  Company.Revenue = which(is.na(df$Company.Revenue))
))
##   Company.Size     Company.Type    Company.Revenue 
##  Min.   :  21.0   Min.   :  21.0   Min.   :  21.0  
##  1st Qu.: 827.8   1st Qu.: 827.8   1st Qu.: 827.8  
##  Median :1105.5   Median :1105.5   Median :1105.5  
##  Mean   :1151.4   Mean   :1151.4   Mean   :1151.4  
##  3rd Qu.:1554.8   3rd Qu.:1554.8   3rd Qu.:1554.8  
##  Max.   :2081.0   Max.   :2081.0   Max.   :2081.0

The missing values in all 3 columns appear to be in the exact same locations. As it is almost 10% of the total number of observations, let’s see if they are also missing the rating data.

rating_columns <- c(
  "Company.Rating",
  "Career.Opportunities",
  "Compensation.and.Benefits",
  "Culture.and.Values",
  "Senior.Management",
  "Work.Life.Balance",
  "Friend.Recommendation"
)
any(!is.na(df[is.na(df$Company.Size), rating_columns]))
## [1] FALSE

Looks like all the 180 jobs are missing the rating data as well, we can just remove them all.

df <- df[!is.na(df$Company.Size), ]
count_missing()
##            Company.Sector          Company.Industry            Company.Rating 
##                       386                       386                       130 
##     Friend.Recommendation           Number.of.Rater      Career.Opportunities 
##                       175                       130                       137 
## Compensation.and.Benefits        Culture.and.Values         Senior.Management 
##                       137                       137                       137 
##         Work.Life.Balance 
##                       137

Now there are still 130 missing values for Company Rating and Number of Rater, again check whether they’re from the same observations.

any(is.na(df$Company.Rating) != is.na(df$Number.of.Rater))
## [1] FALSE

Not a single NA in Company Rating is different from the ones in Company Number of Rater which means they are indeed from the same observations. This makes sense because if the number of rater doesn’t exist then there shouldn’t be any ratings either, so the missing values in Number of Rater can be replaced with 0 indicating no one rated and the rating is an approximation based off the mean rating.

df$Number.of.Rater[is.na(df$Number.of.Rater)] <- 0

Have a peak at the summary of the rating columns.

summary(df[rating_columns])
##  Company.Rating Career.Opportunities Compensation.and.Benefits
##  Min.   :1.60   Min.   :1.000        Min.   :1.000            
##  1st Qu.:3.70   1st Qu.:3.400        1st Qu.:3.400            
##  Median :3.90   Median :3.700        Median :3.600            
##  Mean   :3.88   Mean   :3.688        Mean   :3.607            
##  3rd Qu.:4.10   3rd Qu.:4.000        3rd Qu.:3.900            
##  Max.   :5.00   Max.   :5.000        Max.   :5.000            
##  NA's   :130    NA's   :137          NA's   :137              
##  Culture.and.Values Senior.Management Work.Life.Balance Friend.Recommendation
##  Min.   :1.000      Min.   :1.000     Min.   :1.000     Min.   : 15.00       
##  1st Qu.:3.500      1st Qu.:3.100     1st Qu.:3.300     1st Qu.: 68.00       
##  Median :3.900      Median :3.500     Median :3.800     Median : 79.00       
##  Mean   :3.862      Mean   :3.464     Mean   :3.788     Mean   : 76.39       
##  3rd Qu.:4.100      3rd Qu.:3.700     3rd Qu.:4.100     3rd Qu.: 86.00       
##  Max.   :5.000      Max.   :5.000     Max.   :5.000     Max.   :100.00       
##  NA's   :137        NA's   :137       NA's   :137       NA's   :175

Since the Company Rating is just a number out of 5 we can replace them with the mean value, similarly for Career Opportunities, Compensation and Benefits, Culture and Values, Senior Management and Work Life Balance. With regards to Friend Recommendation, it is just another rating but out of 100, we can do the same for all of them.

df[rating_columns] <- sapply(df[rating_columns], function(column) {
  na <- is.na(column)
  column[na] <- mean(!na)
  column
})
count_missing()
##   Company.Sector Company.Industry 
##              386              386

The 2 variables left both have 386 missing values, check one last time if they are from the same observations.

any(is.na(df$Company.Sector) != is.na(df$Company.Industry))
## [1] FALSE

Same result as before, and since they are categorical, their missing values can just be replaced with a new category Unknown.

industry_column <- c("Company.Sector", "Company.Industry")
df[industry_column] <- sapply(df[industry_column], function(column) {
  column[is.na(column)] <- "Unknown"
  column
})
count_missing()
## named integer(0)

Finally there are no more missing values.

4 Visualisation

4.1 Job Locations

df_locations <- as.data.frame(table(df$Job.Location))
colnames(df_locations) <- c("Location", "Number.of.Jobs")

ggplot(
  df_locations[df_locations$Number.of.Jobs > 10, ],
  aes(
    x = reorder(Location, Number.of.Jobs),
    y = Number.of.Jobs
  ),
) +
  geom_bar(
    stat = "identity",
    width = 0.6,
    fill = "darkcyan"
  ) +
  geom_text(
    aes(label = Number.of.Jobs),
    hjust = 0
  ) +
  labs(
    x = "Job Location",
    title = "Locations with the highest number of jobs"
  ) +
  annotate("text", x = 12, y = 200, label = "This is invalid") +
  annotate("segment", x = 12, y = 140, xend = 12, yend = 80, arrow = arrow(
    type = "closed", length = unit(0.02, "npc")
  )) +
  coord_flip() +
  custom_theme(
    title_b_margin = 10,
    title_size = 14,
    label_size = 12,
    font_size = 11
  )

The bar chart shows the locations in decreasing order in the number of job openings where there are at least 10.

The 6th location Australia should be changed to Unknown as all the other locations are cities other than countries. The reason for this category to appear could either be there are multiple locations in Australia or not sure the exact location.

df$Job.Location[df$Job.Location == "Australia"] <- "Unknown"

Let’s look at the distribution in terms of states.

ClevelandDotPlot(
  df,
  "State",
  sort = 1,
  title = "Jobs Distribution by State"
) +
  coord_flip() +
  custom_theme(title_y_r_margin = 20, title_x_t_margin = 5)

Just looking at the first 5 locations they are all capital cities of their state which makes the state ranking the same. We can plot the Australia states map to see the jobs distribution more clearly.

sf_oz <- ozmap_data("states")

jobs <- as.data.frame(table(df$State))
colnames(jobs) <- c("NAME", "Jobs")

ggplot(merge(sf_oz, jobs, all.x = TRUE)) +
  geom_sf(aes(fill = Jobs)) +
  labs(
    x = "Longitude",
    y = "Latitude"
  ) +
  scale_fill_gradient(low = "purple", high = "lightpink") +
  custom_theme()

4.2 Company

df_companies <- as.data.frame(table(df$Company))
colnames(df_companies) <- c("Company", "Number.of.Jobs")

df_industries <- df[!duplicated(df$Company), c("Company", "Company.Industry")]
df_merge <- merge(df_companies, df_industries)

ggplot(
  head(df_merge[order(-df_merge$Number.of.Jobs), ], 10),
  aes(
    x = reorder(Company, -Number.of.Jobs),
    y = Number.of.Jobs,
    fill = Company.Industry
  )
) +
  geom_bar(
    stat = "identity",
  ) +
  geom_text(
    aes(label = Number.of.Jobs),
    vjust = 1.5,
    colour = "white"
  ) +
  labs(
    x = "Company",
    title = "Top 10 companies in number of job offerings"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  custom_theme(
    title_b_margin = 10,
    title_size = 14,
    label_size = 12,
    font_size = 10
  )

The company Deloitte which is in the Accounting & Tax industry has the most job openings. The number almost doubled the second most company CSIRO which is a National Service & Agencies and the rest are quite close, this is probably because Accounting & Tax requires more data scientists as they deal with numbers the most.

Convert Company Size and Company Revenue to factor so they can be manually ordered based on the categories.

df$Company.Size <- factor(
  df$Company.Size,
  levels = c(
    "Unknown",
    "1 to 50 Employees",
    "51 to 200 Employees",
    "201 to 500 Employees",
    "501 to 1000 Employees",
    "1001 to 5000 Employees",
    "5001 to 10000 Employees",
    "10000+ Employees"
  )
)
df$Company.Revenue <- factor(
  df$Company.Revenue,
  levels = c(
    "Unknown / Non-Applicable",
    "Less than $1 million (USD)",
    "$1 to $5 million (USD)",
    "$5 to $10 million (USD)",
    "$10 to $25 million (USD)",
    "$25 to $50 million (USD)",
    "$50 to $100 million (USD)",
    "$100 to $500 million (USD)",
    "$500 million to $1 billion (USD)",
    "$1 to $2 billion (USD)",
    "$2 to $5 billion (USD)",
    "$5 to $10 billion (USD)",
    "$10+ billion (USD)"
  )
)
summary(df[c("Company.Size", "Company.Revenue")])
##                   Company.Size                         Company.Revenue
##  10000+ Employees       :538   Unknown / Non-Applicable        :699   
##  1001 to 5000 Employees :372   $10+ billion (USD)              :400   
##  5001 to 10000 Employees:300   $2 to $5 billion (USD)          :177   
##  1 to 50 Employees      :189   $500 million to $1 billion (USD):134   
##  501 to 1000 Employees  :170   $5 to $10 billion (USD)         :113   
##  Unknown                :153   $100 to $500 million (USD)      : 95   
##  (Other)                :183   (Other)                         :287

Let’s see if there is a relationship between Company Size and Company Revenue, plot a stacked bar chart excluding Unknown or NA options.

ggplot(
  df[df$Company.Revenue != "Unknown / Non-Applicable" &
    !duplicated(df$Company), ]
) +
  geom_bar(
    aes(x = Company.Revenue, fill = Company.Size),
    alpha = 0.5
  ) +
  labs(y = "Number of Companies") +
  theme(
    axis.text.x = element_text(angle = 75, hjust = 1.05),
    axis.title = element_text(size = 16)
  ) +
  custom_theme(title_x_t_margin = 15, font_size = 10, label_size = 12)

Looks like in general the bigger the company is the higher their revenue is. This can be seen through a couple observations, small companies with fewer than 50 employees only have a revenue less than 25 million, medium sized companies with 501 to 1000 employees have a revenue between 50 million to 2 billion, and most of the large companies with more than 10000 employees have a revenue of over 10 billion.

Another observation that can be made is there are more companies with higher revenue, which is because the bigger the company is the more they want to grow therefore more likely to have open positions.

ggplot(df[df$Company.Sector != "Unknown", ]) +
  geom_count(aes(x = Company.Sector, y = Company.Type, colour = ..n..)) +
  theme(axis.text.x = element_text(angle = 75, hjust = 1)) +
  scale_color_gradient(low = "brown", high = "magenta")

Plotting Company Type against Company Sector. Some make perfect sense for example all College / University are in Education and all Hospital are in Healthcare. Some are more interesting to look at like Public and Private as we can see they both have a big focus on Finance and Information Technology while Public companies also have an emphasis on Pharmaceutical & Biotechnology and Private companies are in Human Resources & Staffing sector where Public companies are not.

4.3 Salary

salary_columns <- c("High.Estimate", "Estimate.Base.Salary", "Low.Estimate")
df_salary <- stack(df[salary_columns])
colnames(df_salary) <- c("amount", "salary")

ggplot(df_salary) +
  geom_boxplot(
    aes(x = salary, y = amount, fill = salary),
    outlier.colour = "red"
  ) +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  custom_theme(title_x_t_margin = 15)

In job descriptions the salary package is often provided as a range rather than a fixed number as there is room for negotiation, the lower end of the range is Low Estimate and the higher end is High Estimate, Estimate Base Salary is the average of the two.

As we can see from the box plot there are many outliers with really high starting salary and no outliers on the minimum wage. Since all 3 variables have similar distribution we’ll just use the Estimate Base Salary for base salaries from now on as it is more representative.

Let’s find out who the outliers are that offer such high base salary.

df_salary <- df[
  !duplicated(df$Estimate.Base.Salary),
  c(
    "Company",
    "Estimate.Base.Salary",
    "Company.Revenue"
  )
]

df_salary[order(-df_salary$Estimate.Base.Salary), ] %>%
  head(10) %>%
  ggplot() +
    geom_point(aes(
      x = Company.Revenue,
      y = Estimate.Base.Salary,
      colour = Company
    )) +
    labs(title = "Top 10 Highest Paying Jobs") +
    scale_y_continuous(labels = scales::dollar_format()) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    custom_theme(
      title_b_margin = 10,
      title_size = 14,
      label_size = 12,
      font_size = 10
    )

Looks like a couple multi billion dollar companies are, especially Indeed who offers some of the highest income jobs, which makes sense as they have the revenue to do so. However majority of the jobs still have a starting salary of around $100,000 as shown by the histogram and density plot below.

ggplot(df, aes(x = Estimate.Base.Salary)) +
  geom_histogram(
    aes(y = ..density..),
    binwidth = 8000,
    alpha = 0.8,
    colour = "skyblue",
    fill = "lightblue"
  ) +
  geom_density(colour = "darkred") +
  scale_x_continuous(labels = scales::dollar_format()) +
  theme(axis.text.x = element_text(hjust = 1, vjust = -0.5)) +
  custom_theme(title_y_r_margin = 15, title_x_t_margin = 15)

4.4 Rating

ggplot(df, aes(x = Company.Rating, y = Estimate.Base.Salary)) +
  geom_point(colour = "darkgreen", shape = 23) +
  geom_smooth() +
  scale_y_continuous(labels = scales::dollar_format()) +
  custom_theme()

There is a small trend as the higher rating the company receives the higher base salary it pays, but the fitted curve is strongly affected by outliers as there are a lot of 5 star ratings that pay very low salary. This might be due to a portion of people who care more about other things like company culture than money.

df_ratings <- df[rating_columns]

# Replace dots with newlines so that separate words are on
# different lines to fit in the scatter matrix
colnames(df_ratings) <- gsub("\\.", "\n", rating_columns)

pairs(df_ratings)

As suspected all the other criteria seem to be linearly correlated to the Company Rating, so the overall rating is more closely related to all the criteria than salary, ie. high salary does not necessarily give the company a good rating.

4.5 Skill

javascript_r <- ggplot(
  count(df, javascript_yn, r_yn),
  aes(x = javascript_yn, y = r_yn),
) +
  geom_tile(aes(fill = n)) +
  coord_equal() +
  labs(y = "R", x = "JavaScript") +
  theme(axis.title.y = element_text(angle = 0, vjust = 0.5)) +
  custom_theme(font_size = 10, label_size = 12)

python_r <- ggplot(
  count(df, python_yn, r_yn),
  aes(x = python_yn, y = r_yn),
) +
  geom_tile(aes(fill = n)) +
  coord_equal() +
  labs(y = "", x = "Python") +
  custom_theme(font_size = 10, label_size = 12)

grid.arrange(javascript_r, python_r, ncol = 2)

# Count the number of jobs that require JavaScript which also require R
sum(which(df$javascript_yn) %in% which(df$r_yn))
## [1] 0

Not a single job requires both JavaScript and R but there are a lot of overlaps between Python and R. There are more jobs that prefer R over JavaScript but Python over R.

5 Classification

5.1 Target Variable

Sometimes job postings do not include the salary. A classification model that predicts the salary range of future job postings, based on past job postings with salary information, would be incredibly useful for job hunters who prioritise salary when applying for jobs. As such, the response variable we have chosen is the salary level of the job posting.

The original dataset included three numeric salary variables: High.Estimate, Estimate.Base.Salary, and Low.Estimate. In job descriptions the salary package is often provided as a range rather than a fixed number as there is room for negotiation, the lower end of the range is Low.Estimate and the higher end is High.Estimate, Estimate.Base.Salary is the average of the two. Hence focusing on the Estimate.Base.Salary this can be a binary classification problem of whether the salary is high. The cutoff between these two classes is the median of the Estimate.Base.Salary, $95,000, so one class has a salary between $0 - $95,000 and is assigned a label of FALSE, the other is a salary above $95,000 and is assigned a label of TRUE. Since the cutoff is based on the median, the dataset is almost balanced.

target <- "High.Salary"
df[, target] <- df$Estimate.Base.Salary >= median(df$Estimate.Base.Salary)

paste(
  "There are",
  nrow(df[df[, target], ]),
  "high salary observations. There are",
  nrow(df[!df[, target], ]),
  "low salary observations."
)
## [1] "There are 955 high salary observations. There are 950 low salary observations."

5.2 Feature Variables

Identify the categorical and numerical feature variables.

features <- setdiff(colnames(df), target)

# Convert character to factor
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], as.factor)

categorical_variables <- features[
  sapply(df[, features], class) %in% c("factor", "logical")
]
numerical_variables <- features[
  sapply(df[, features], class) %in% "numeric"
]

paste(
  "There are",
  length(categorical_variables),
  "categorical features,",
  length(numerical_variables),
  "numerical features",
  "and 1 target column."
)
## [1] "There are 37 categorical features, 8 numerical features and 1 target column."

Convert numerical variables to categorical so that they can be used to create single variable models, but still keep the original numerical variables for clustering.

numerical_to_categorical <- c()
for (variable in numerical_variables) {
  categorical_variable <- paste(variable, "categorical", sep = "_")
  numerical_to_categorical <- c(numerical_to_categorical, categorical_variable)
  if (variable == "Friend.Recommendation") { # A rating out of 100
    df[categorical_variable] <- cut(
      df[, variable],
      seq(0, 100, 10)
    )
  } else { # Ratings out of 5
    df[categorical_variable] <- cut(
      df[, variable],
      seq(0, 5, 0.5)
    )
  }
}

5.3 Test and Training Sets

Perform a 80/20 random split on the dataset to get a training and test set.

split <- runif(nrow(df))
training_set <- df[split < 0.8, ]
test_set <- df[split >= 0.8, ]

paste(
  "The training and test set has",
  nrow(training_set),
  "and",
  nrow(test_set),
  "observations respectively."
)
## [1] "The training and test set has 1515 and 390 observations respectively."

5.4 Null model

The null model will always return the the majority category. As mentioned earlier, the dataset is almost balanced so the model will have a 50% chance of predicting High.Salary (TRUE), resulting in an Area Under the Curve (AUC) of 0.5.

calc_auc <- function(pred, ground_truth) {
  round(as.numeric(
    performance(prediction(pred, ground_truth), "auc")@y.values
  ), 4)
}

calc_log_likelihood <- function(pred, ground_truth) {
  pred <- pred[pred > 0 & pred < 1]
  round(sum(ifelse(ground_truth, log(pred), log(1 - pred))))
}

null_model <- sum(training_set[target]) / nrow(training_set)

Create a data frame to store the AUC and Log Likelihood of different models.

# Function to calculate the AUC and Log Likelihood then store their values
# to the global data frame which contains the same values for other models
calc_auc_log_likelihood <- function(pred, name, type) {
  auc <- calc_auc(pred, test_set[target])
  log_likelihood <- calc_log_likelihood(pred, test_set[, target])
  print(paste("AUC:", auc))
  print(paste("Log Likelihood:", log_likelihood))

  model_evaluations[nrow(model_evaluations) + 1, ] <- c(
    name,
    type,
    auc,
    log_likelihood
  )
  assign("model_evaluations", model_evaluations, envir = .GlobalEnv)
}

model_evaluations <- data.frame(
  Model.Name = "Null Model",
  Model.Type = "univariate",
  AUC = calc_auc(rep(null_model, nrow(test_set)), test_set[target]),
  Log.Likelihood = calc_log_likelihood(null_model, test_set[, target])
)
kable(model_evaluations)
Model.Name Model.Type AUC Log.Likelihood
Null Model univariate 0.5 -270

5.5 Single Variable Model

Single variable model based on categorical variables.

single_variable_prediction <- function(pred_col, output_col, test_col) {
  t <- table(pred_col, output_col)
  pred <- (t[, 2] / (t[, 1] + t[, 2]))[as.character(test_col)]
  pred[is.na(pred)] <- sum(output_col) / length(output_col)
  pred
}

cross_validation_100_fold <- function(variable) {
  aucs <- rep(0, 100)
  for (i in seq(aucs)) {
    split <- rbinom(n = nrow(training_set), size = 1, prob = 0.1) == 1
    pred_col <- single_variable_prediction(
      training_set[split, variable],
      training_set[split, target],
      training_set[!split, variable]
    )
    aucs[i] <- calc_auc(pred_col, training_set[!split, target])
  }
  mean(aucs)
}

Find the average AUC for each variable over 100 fold cross validation and save the predicted probabilities back to the data frame so that they can be used in Logistic Regression.

single_variable_models <- data.frame(matrix(ncol = 2, nrow = 0))
colnames(single_variable_models) <- c("Variable", "AUC")
for (variable in c(numerical_to_categorical, categorical_variables)) {
  auc <- cross_validation_100_fold(variable)
  single_variable_models[nrow(single_variable_models) + 1, ] <-
    c(variable, auc)

  training_set[paste(variable, "pred", sep = "_")] <-
    single_variable_prediction(
      training_set[, variable],
      training_set[, target],
      training_set[, variable]
    )

  test_set[paste(variable, "pred", sep = "_")] <-
    single_variable_prediction(
      training_set[, variable],
      training_set[, target],
      test_set[, variable]
    )
}

Select the variables with an average AUC higher than 0.6 to be the features used in model trainings later on.

selected_models <-
  single_variable_models[
    single_variable_models$AUC > 0.6,
  ]
selected_features <- selected_models$Variable

selected_models <-
  selected_models[
    order(selected_models$AUC, decreasing = TRUE),
  ]
row.names(selected_models) <- NULL

kable(selected_models)
Variable AUC
Company 0.887885
Job.Title 0.797329
Company.Industry 0.763021
Friend.Recommendation_categorical 0.656102
Job.Location 0.643857
Company.Revenue 0.631478
Company.Sector 0.628835
sql_yn 0.624235
Senior.Management_categorical 0.624152
Compensation.and.Benefits_categorical 0.618052
Culture.and.Values_categorical 0.605179

Pick the top 2 single variable models based on their average AUC which are Company and Job.Title having a unusually high value of almost 0.9 and 0.8. However this makes perfect sense as within the same company most data science jobs will very likely to have the same salary, and similarly the exact same job title usually would not have a big difference in their salaries.

company_pred <- single_variable_prediction(
  training_set$Company,
  training_set[, target],
  test_set$Company
)
job_title_pred <- single_variable_prediction(
  training_set$Job.Title,
  training_set[, target],
  test_set$Job.Title
)

# Calculate their AUC and Log Likelihood using the test set.
calc_auc_log_likelihood(company_pred, "Company", "univariate")
## [1] "AUC: 0.9561"
## [1] "Log Likelihood: -401"
calc_auc_log_likelihood(job_title_pred, "Job Title", "univariate")
## [1] "AUC: 0.904"
## [1] "Log Likelihood: -346"

The metrics measured on the test set are higher than on the validation set because the latter was the average across 100 folds.

Plot their predicted probabilities next to each other.

double_density_plot <- function(
  pred_col,
  output_col,
  x,
  y
) {
  ggplot(data.frame(
    pred = pred_col,
    High.Salary = output_col
  )) +
    geom_density(aes(x = pred, colour = High.Salary)) +
    labs(x = paste("Predicated Probability of", x), y = y)
}

grid.arrange(
  double_density_plot(
    company_pred,
    test_set[target],
    "Company",
    "Density"
  ),
  double_density_plot(
    job_title_pred,
    test_set[target],
    "Job Title",
    ""
  ),
  ncol = 2
)

Compare the ROC curves by plotting them on top of each other.

roc_plot <- function(pred_col, out_col, colour = "red", overlaid = FALSE) {
  par(new = overlaid)
  plot(
    rocit(score = pred_col, class = out_col),
    col = c(colour, "black"),
    legend = FALSE,
    YIndex = FALSE
  )
}

roc_plot(company_pred, test_set[, target], "red")
roc_plot(job_title_pred, test_set[, target], "blue", TRUE)
legend(
  "bottomright",
  col = c("red", "blue"),
  c("Company", "Job Title"),
  lwd = 2
)

As shown above using the AUC metric, Company performs moderately better than Job.Title however using the Log Likelihood metric, Job.Title is better. This means Company has a higher performance averaged across all possible decision thresholds but Job.Title gives a higher certainty in its predictions as Log Likelihood measures how close the predicted probabilities are to the ground truth (0 or 1).

5.6 Model Evaluation

Functions to call on a model to make predictions, calculate the AUC and Log Likelihood, evaluate which features play a key role in determining if a job posting offers high or low salary.

# Function to print the confusion matrix as well as calculating
# the precision and recall
confusion_matrix_accuracy <- function(model, features) {
  pred <- as.logical(predict(
    model,
    test_set[features],
  ))

  confusion_matrix <- table(
    ifelse(test_set[, target], "High Salary", "Low Salary"),
    pred
  )[, 2:1]

  print(paste(
    "Precision:",
    format(confusion_matrix[1, 1] / sum(confusion_matrix[, 1]), digits = 3)
  ))

  print(paste(
    "Recall:",
    format(confusion_matrix[1, 1] / sum(confusion_matrix[1, ]), digits = 3)
  ))

  print(kable(confusion_matrix))
  pred
}

# Function to calculate the AUC and Log Likelihood as well as generating a
# double density plot
evaluate_model <- function(model, features, name) {
  pred <- predict(
    model,
    test_set[features],
    "prob"
  )[2]
  pred <- unlist(pred)

  calc_auc_log_likelihood(pred, name, "multivariate")

  plot(double_density_plot(
    pred,
    test_set[target],
    name,
    "Density"
  ))
  pred
}

# Function to plot the explanation of how the individual features
# support or contradict a prediction
lime_plot <- function(model, features, pred) {
  # Pick 4 examples for LIME to explain
  test_cases <- c()

  # True Positive
  for (i in seq(length(pred))) {
    if (test_set[i, target] && pred[i]) {
      test_cases <- c(test_cases, i)
      break
    }
  }

  # False Negative
  for (i in seq(length(pred))) {
    if (test_set[i, target] && !pred[i]) {
      test_cases <- c(test_cases, i)
      break
    }
  }

  # False Positive
  for (i in seq(length(pred))) {
    if (!test_set[i, target] && pred[i]) {
      test_cases <- c(test_cases, i)
      break
    }
  }

  # True Negative
  for (i in seq(length(pred))) {
    if (!test_set[i, target] && !pred[i]) {
      test_cases <- c(test_cases, i)
      break
    }
  }

  example <- test_set[test_cases, features]
  explainer <- lime(
    training_set[features],
    model = model,
    bin_continuous = TRUE,
    n_bins = 10
  )

  explanation <- explain(
    example,
    explainer,
    n_labels = 1,
    n_features = length(features)
  )
  plot_features(explanation)
}

The LIME plot will explain 4 test cases, top left is a True Positive instance, top right is a False Negative instance, bottom left is a False Positive instance and bottom right is a True Negative instance.

5.7 Naive Bayes

Naive Bayes classifier works well on categorical variables which is why it is chosen for this problem as there are many categorical variables with lots of levels.

naive_bayes <- caret::train(
  x = training_set[selected_features],
  y = as.factor(training_set[, target]),
  method = "nb"
)

Naive Bayes has an AUC of 0.9667 and a Log Likelihood of -116 which outperforms the highest single variable model of Company with an AUC of 0.904 and a Log Likelihood of -401. This means Naive Bayes is doing a great job of combining the top performing variables to improve its predictions even further.

naive_bayes_pred <- evaluate_model(
  naive_bayes,
  selected_features,
  "Naive Bayes"
)
## [1] "AUC: 0.9667"
## [1] "Log Likelihood: -116"

The number of False Positive and False Negative cases are the same which results in the exact same precision and recall.

pred <- confusion_matrix_accuracy(naive_bayes, selected_features)
## [1] "Precision: 0.915"
## [1] "Recall: 0.915"
## 
## 
## |            | TRUE| FALSE|
## |:-----------|----:|-----:|
## |High Salary |  182|    17|
## |Low Salary  |   17|   174|

Except for the False Negative case shown by the top right LIME plot, 2 of the No.1 determining features are Company which is the top performing single variable model, and the other is Job.Title which is the second best variable.

lime_plot(naive_bayes, selected_features, pred)

5.8 Logistic Regression

Logistic Regression can be used to classify a variable dependent on one or more independent features. It will find the best fitting model to describe the relationship between the dependent and the independent variables. As it is a binary classification task binomial distribution is used.

Due to every categorical variable has to be expanded to a set of indicator variables in Logistic Regression, it does not work well with large number of levels. Therefore instead of using the original data whose categorical variables contain many levels, the predicted probabilities from the single variable models will be used.

probability_columns <- paste(selected_features, "pred", sep = "_")
logistic_regression <- caret::train(
  x = training_set[probability_columns],
  y = as.factor(training_set[, target]),
  method = "glm",
  family = binomial(link = "logit")
)

Logistic Regression has an AUC of 0.9672 and a Log Likelihood of -120 which is pretty much the same as Naive Bayes. However the difference between them is in the precision and recall.

logistic_regression_pred <- evaluate_model(
  logistic_regression,
  probability_columns,
  "Logistic Regression"
)
## [1] "AUC: 0.9672"
## [1] "Log Likelihood: -120"

Logistic Regression has a lower precision but a higher recall than Naive Bayes which means it is able to identify more of the high salary jobs than Naive Bayes while making more mistakes predicting low salary jobs as high salary.

pred <- confusion_matrix_accuracy(logistic_regression, probability_columns)
## [1] "Precision: 0.898"
## [1] "Recall: 0.93"
## 
## 
## |            | TRUE| FALSE|
## |:-----------|----:|-----:|
## |High Salary |  185|    14|
## |Low Salary  |   21|   170|

Once again except for the False Negative case shown by the top right LIME plot, the predicted probabilities from the Company single varible model is the most supporting feature which is also the top performing variable.

lime_plot(logistic_regression, probability_columns, pred)

The top 2 highest performing single model variables Company and Job.Title are also indicated as the 2 most significant variables shown under the Coefficients part of the summary as their p-value are much smaller than others.

summary(logistic_regression)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.5452  -0.0505  -0.0003   0.0366   4.1732  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                                -8.97379    1.99969  -4.488  7.2e-06
## Friend.Recommendation_categorical_pred     -4.73071    1.87729  -2.520 0.011737
## Compensation.and.Benefits_categorical_pred -1.94267    2.24923  -0.864 0.387749
## Culture.and.Values_categorical_pred        -0.06096    2.83143  -0.022 0.982824
## Senior.Management_categorical_pred          3.73655    2.49100   1.500 0.133609
## Job.Title_pred                              8.45650    0.89265   9.474  < 2e-16
## Job.Location_pred                          10.56725    2.25561   4.685  2.8e-06
## Company_pred                               11.59918    1.35439   8.564  < 2e-16
## Company.Sector_pred                         4.42798    1.56416   2.831 0.004642
## Company.Industry_pred                      -2.77013    1.26330  -2.193 0.028324
## Company.Revenue_pred                       -4.83367    2.13271  -2.266 0.023424
## sql_yn_pred                                -6.01158    1.65232  -3.638 0.000274
##                                               
## (Intercept)                                ***
## Friend.Recommendation_categorical_pred     *  
## Compensation.and.Benefits_categorical_pred    
## Culture.and.Values_categorical_pred           
## Senior.Management_categorical_pred            
## Job.Title_pred                             ***
## Job.Location_pred                          ***
## Company_pred                               ***
## Company.Sector_pred                        ** 
## Company.Industry_pred                      *  
## Company.Revenue_pred                       *  
## sql_yn_pred                                ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2100.23  on 1514  degrees of freedom
## Residual deviance:  228.73  on 1503  degrees of freedom
## AIC: 252.73
## 
## Number of Fisher Scoring iterations: 9

5.9 Comparison

There are a couple single variable models that perform even worse than the null model which has an AUC of 0.5 on this balanced dataset. However the majority of them perform quite well and only the AUCs above 0.6 will be used as the features for training the classification models. Therefore these low performing features will not affect the performance of the models.

paste0(
  round(
    nrow(single_variable_models[single_variable_models$AUC > 0.5, ]) /
    nrow(single_variable_models) *
    100
  ),
  "% of the single variable models perform better than the null model."
)
## [1] "87% of the single variable models perform better than the null model."
single_variable_models <-
  single_variable_models[
    order(single_variable_models$AUC),
  ]
row.names(single_variable_models) <- NULL
kable(head(single_variable_models))
Variable AUC
git_yn 0.498739
spark_yn 0.499552
nosql_yn 0.499916
sas_yn 0.499954
cassandra_yn 0.5
bigml_yn 0.5
paste(
  nrow(selected_models),
  "features with an AUC above 0.6 are selected out of",
  nrow(single_variable_models),
  "the features."
)
## [1] "11 features with an AUC above 0.6 are selected out of 45 the features."

Even though the good single variable models perform quite well in this dataset due to the large amount of levels in them and a reasonably easy binary classification task, both Naive Bayes and Logistic Regression can still take advantage of the high performing variables to make even better predictions. They have quite a similar performance but Logistic Regression has to rely on the probabilities from the single variable models to train because it cannot handle categorical variables with many levels.

kable(model_evaluations)
Model.Name Model.Type AUC Log.Likelihood
Null Model univariate 0.5 -270
Company univariate 0.9561 -401
Job Title univariate 0.904 -346
Naive Bayes multivariate 0.9667 -116
Logistic Regression multivariate 0.9672 -120
roc_plot(naive_bayes_pred, test_set[, target], "red")
roc_plot(logistic_regression_pred, test_set[, target], "blue", TRUE)

legend(
  "bottomright",
  col = c("red", "blue"),
  c("Naive Bayes", "Logistic Regression"),
  lwd = 2
)

6 Clustering

The goal of clustering is to discover similarities among subsets of the data. Given a binary classification was previously performed, it will be interesting to see the dataset forms clusters around the salary of job postings. As clustering techniques work better on numerical variables, first extract them and apply scaling. As this dataset contains mostly categorical variables the only numerical variables are salaries and ratings.

clustering_df <- scale(
  df[, colnames(df[sapply(df, class) %in% c("numeric", "integer")])]
)
head(clustering_df)
##   Estimate.Base.Salary Low.Estimate High.Estimate Company.Rating
## 1           -0.1547827   -0.3187081   0.005765876     0.02342043
## 2           -0.1355627   -0.3533227   0.080998705     0.24390174
## 3           -0.1355627   -0.3533227   0.080998705     0.24390174
## 4           -0.1355627   -0.3533227   0.080998705     0.24390174
## 5            0.4391111    0.1658954   0.707938948     0.24390174
## 6            0.4391111    0.1658954   0.707938948     0.46438305
##   Friend.Recommendation Number.of.Rater Career.Opportunities
## 1             0.0589156     -0.30021124            0.4557051
## 2             0.2501755     -0.12520641            0.1226472
## 3             0.2501755     -0.12520641            0.1226472
## 4             0.2501755     -0.12520641            0.1226472
## 5             0.4031834      0.12812981            0.4557051
## 6             0.5944433     -0.03220137            0.4557051
##   Compensation.and.Benefits Culture.and.Values Senior.Management
## 1                 0.3274500          0.4788573         0.4878952
## 2                 0.5568899          0.1588735         0.2547928
## 3                 0.5568899          0.1588735         0.2547928
## 4                 0.5568899          0.1588735         0.2547928
## 5                 0.3274500          0.4788573         0.4878952
## 6                 0.3274500          0.4788573         0.4878952
##   Work.Life.Balance
## 1         0.4482363
## 2         0.1260349
## 3         0.1260349
## 4         0.1260349
## 5         0.4482363
## 6         0.4482363

6.1 Hierarchical Clustering

Hierarchical Clustering is chosen over kMeans as there is no clear number of clusters expected to be formed from this dataset. The focus is more on exploring possible partitions in the data.

hc <- hclust(dist(clustering_df, method = "euclidean"), method = "ward.D2")

Plot the dendrogram. It looks like a majority of the data tend to form a big cluster, even though forming 2 or 3 clusters are highly stable, there is a big cluster which should be divided up as much as possible. Plotting the horizontal lines helps visualise the stability of different number of clusters, 7 or 8 seem to be the largest and still stable choice.

hcd <- as.dendrogram(hc)
plot(hcd, ylab = "Height", leaflab = "none", horiz = FALSE)
abline(h = 54, col = "red", lty = 2, lwd = 1.3)
abline(h = 51.4, col = "red", lty = 2, lwd = 1.3)
abline(h = 45.9, col = "red", lty = 2, lwd = 1.3)
abline(h = 42.91, col = "red", lty = 2, lwd = 1.3)
abline(h = 34.6, col = "red", lty = 2, lwd = 1.3)
abline(h = 26.5, col = "red", lty = 2, lwd = 1.3)
text(5, c(52.8, 48.5, 44.4, 39, 30), paste("k =", 4:8), col = "blue", cex = 0.7)

Cut the dendrogram at different heights to return the cluster sizes. Show the distribution of observations for 1 to 9 clusters.

max_k <- 9
# Number of observations in each class
xtabs(~cluster + max_clust, as.data.frame(cutree(hc, 1:max_k)) %>%
  pivot_longer(
    cols = 1:max_k,
    names_to = "max_clust",
    values_to = "cluster"
  )
)
##        max_clust
## cluster    1    2    3    4    5    6    7    8    9
##       1 1905 1768 1368  986  986  986  898  377  377
##       2    0  137  400  382  194  194  194  194  194
##       3    0    0  137  400  400   50   50   50   50
##       4    0    0    0  137  188  188  188  188  168
##       5    0    0    0    0  137  137  137  521  521
##       6    0    0    0    0    0  350  350  137  137
##       7    0    0    0    0    0    0   88  350  350
##       8    0    0    0    0    0    0    0   88   20
##       9    0    0    0    0    0    0    0    0   88

This table indicates the distribution of observations at cluster levels 1 to 9. The dataset has 1905 observations, and this splits into 1768 and 137 at a cluster level of 2. At the third cluster level the dataset is split into 1368, 400 and 137 observations. It is noticeable that the majority of the data shows similiar characteristics, except for the 137 observations which are significantly different as they are always in one cluster by themselves throughout 9 levels.

6.2 Optimal Number of Clusters

To determine the optimal number of clusters for the dataset, the total Within Sum of Squares (WSS) and the Calinski-Harabasz index (CH Index) should be measured. An optimal number of clusters is defined such that WSS is minimised and the CH Index is maximised as there is limited variance within clusters and large variance between clusters.

# Function to calculate the WSS of a cluster
wss <- function(cluster) {
  c0 <- colMeans(cluster)
  sum(apply(cluster, 1, function(row) sum((row - c0) ^ 2)))
}

# Function to calculate the total WSS
wss_total <- function(df, labels) {
  total <- 0
  for (i in seq(unique(labels))) {
    total <- total + wss(subset(df, labels == i))
  }
  total
}

# Function to calculate the CH indices computed using hierarchical clustering
ch_index <- function(df, max_k, hc) {
  npts <- nrow(df)
  wss_values <- numeric(max_k) # create a vector of numeric type

  # wss_values[1] stores the WSS value for k = 1
  # when all the data points form 1 large cluster
  wss_values[1] <- wss(df)

  for (k in 2:max_k) {
    labels <- cutree(hc, k)
    wss_values[k] <- wss_total(df, labels)
  }

  # CH Index = B / W
  b <- (wss(df) - wss_values) / (0:(max_k - 1))
  w <- wss_values / (npts - seq(max_k))
  data.frame(k = seq(max_k), CH.index = b / w, WSS = wss_values)
}

Plot the CH Index and WSS across different k values from 1 to 9 to visualise the optimal number of clusters.

ch_criterion <- ch_index(clustering_df, max_k, hc)
grid.arrange(
  ggplot(ch_criterion, aes(x = k, y = CH.index)) +
    geom_point() +
    geom_line(colour = "red") +
    scale_x_continuous(breaks = 1:max_k, labels = 1:max_k) +
    labs(y = "CH index"),
  ggplot(ch_criterion, aes(x = k, y = WSS), color = "blue") +
    geom_point() + geom_line(colour = "blue") +
    scale_x_continuous(breaks = 1:max_k, labels = 1:max_k),
  ncol = 2
)

The CH criterion is maximised at k = 2 with another local maximum at k = 8, whereas the WSS is minimised at k = 9. Therefore 8 clusters can be considered a reasonable estimate of the optimal number of clusters which maximises the distance between clusters and minimises the variability within clusters. As this hypothesis reinforces the conclusion from early on where 8 clusters is one the largest stable choices. Plot the dendrogram again but with the 8 clusters to visualise it.

optimal_k <- 8
plot(hcd, ylab = "Height", leaflab = "none", horiz = FALSE)
rect.hclust(hc, k = optimal_k)

6.3 Validating Clusters

Use PCA to project the data into 2D so that the distribution of clusters can be visualised through plotting the convex hulls.

pca <- prcomp(clustering_df)
project_2d <- as.data.frame(predict(pca, newdata = clustering_df)[, c(1, 2)])

find_convex_hull <- function(project_2d, clusters) {
  do.call(rbind,
    lapply(
      unique(clusters),
      function(c) {
        f <- subset(project_2d, cluster == c)
        f[chull(f), ]
      }
    )
  )
}

fig <- c()
for (k in seq(2, optimal_k + 1)) {
  clusters <- cutree(hc, k)
  project_2d_df <- cbind(
    project_2d,
    cluster = as.factor(clusters),
    salary = df$Estimate.Base.Salary
  )
  convex_hull <- find_convex_hull(project_2d_df, clusters)
  assign(paste0("k", k),
    ggplot(project_2d_df, aes(x = PC1, y = PC2)) +
      geom_point(aes(shape = cluster, color = cluster, alpha = 0.1)) +
      geom_polygon(data = convex_hull, aes(group = cluster, fill = cluster),
      alpha = 0.4, linetype = 0) +
      labs(title = sprintf("k = %d", k)) +
      theme(legend.position = "none")
  )
}

grid.arrange(k2, k3, k4, k5, k6, k7, k8, k9, ncol = 4)

As evident in the figure above, the data is first split into 2 clusters, the large one on the left demonstrates significant variability as increasing the number of clusters always divides it into smaller clusters. On the other hand the small one on the right never changes since the first split.

# Find out how stable the clusters are
clusterboot_hclust <- clusterboot(
  clustering_df,
  clustermethod = hclustCBI,
  method = "ward.D2",
  k = optimal_k
)

Except for 1 cluster with a stability of 0.43, every other clusters are highly stable as their values are close to 1. This matches the cluster distribution above for k = 2 to 9 where that 1 big cluster is formed at the beginning and as the number of clusters increases, it is dissolved into smaller clusters while the others stay the same once the cluster is formed.

kable(data.frame(
  Cluster = seq(clusterboot_hclust$bootbrd),
  Stability = 1 - clusterboot_hclust$bootbrd / 100
))
Cluster Stability
1 0.43
2 0.88
3 0.71
4 1.00
5 0.74
6 1.00
7 0.70
8 0.66

6.4 Exploring Clusters

Now that the optimal number of clusters is found, use the 8 clusters to explore the patterns they represent in the data.

# Append the cluster number to the original dataset
df$Cluster <- as.factor(cutree(hc, optimal_k))

6.4.1 Job Locations

Plot a filled bar chart to investigate whether each cluster tend to be in the same state.

ggplot(df) +
  geom_bar(
    aes(x = Cluster, fill = State),
    alpha = 0.7,
    position = "fill"
  )

Unfortunately there is no geographical pattern to be found in the clusters as all of the states have some percentage of data in each cluster and they are quite evenly spread out.

6.4.2 Salary

Plot a histogram to investigate whether the clusters form around around different levels of salary.

ggplot(df) +
  geom_histogram(
    aes(x = Estimate.Base.Salary, fill = Cluster),
    binwidth = 8000,
    alpha = 0.7
  ) +
  scale_x_continuous(
    labels = scales::dollar_format(),
    breaks = seq(0, 300000, 20000)
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.title.x = element_text(margin = margin(t = 10))
  )

Cluster 5 is the largest cluster and covers the low end salaries below $110,000. Cluster 7 predominantly covers salaries in the range of $120,000 to $170,000. Cluster 3 covers the high end salaries over $180,000. The rest of the clusters overlap in the range from $60,000 to $130,000 but for the most part the figure demonstrates a separation of the data into “low”, “medium”, and “high” salaries.

6.4.3 Rating

Plot a scatter plot to investigate whether the company ratings have any relationships with the clusters formed.

ggplot(df) +
  geom_point(
    aes(
      x = Company.Rating,
      y = Estimate.Base.Salary,
      colour = Cluster
    ),
    alpha = 0.7
  ) +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme(axis.title.y = element_text(margin = margin(r = 5)))

There is a much clearer separation between clusters here. Cluster 6 contains all the low ratings of 1. Cluster 8 covers the rating range between 1.5 to 3. Cluster 5 resides between rating 3 and 4 but the lower end of the salaries while cluster 7 has the higher end of the same ratings, and cluster 1 and 4 overlap with each other both centred around the medium salaries. Cluster 2 is situated at higher ratings above 4.5. Cluster 3 as the smallest cluster has all the very high salaries with ratings between 4 and 5.

This separation comes from the fact that the only numerical variables in this dataset are the salaries and ratings so they are the determining factors in the clustering process.